68 research outputs found

    Observations on Multi-Band Asynchrony in Distant Speech Recordings

    Get PDF
    Whenever the speech signal is captured by a microphone distant from the user, the acoustic response of the room introduces significant distortions. To remove these distortions from the signal, solutions exist that greatly improve the ASR performance (what was said?), such as dereverberation or beamforming. It may seem natural to apply those signal-level methods in the context of speaker clustering (who spoke when?) with distant microphones, for example when annotating a meeting recording for enhanced browsing experience. Unfortunately, on a corpus of real meeting recordings, it appeared that neither dereverberation nor beamforming gave any improvement on the speaker clustering task. The present technical report constitutes a first attempt to explain this failure, through a cross-correlation analysis between close-talking and distant microphone signals. The various frequency bands of the speech spectrum appear to become desynchronized when the speaker is 1 or 2 meters away from the microphone. Further directions of research are suggested to model this desynchronization

    Spatio-Temporal Analysis of Spontaneous Speech with Microphone Arrays

    Get PDF
    Accurate detection, localization and tracking of multiple moving speakers permits a wide spectrum of applications. Techniques are required that are versatile, robust to environmental variations, and not constraining for non-technical end-users. Based on distant recording of spontaneous multiparty conversations, this thesis focuses on the use of microphone arrays to address the question Who spoke where and when?. The speed, the versatility and the robustness of the proposed techniques are tested on a variety of real indoor recordings, including multiple moving speakers as well as seated speakers in meetings. Optimized implementations are provided in most cases. We propose to discretize the physical space into a few sectors, and for each time frame, to determine which sectors contain active acoustic sources (Where? When?). A topological interpretation of beamforming is proposed, which permits both to evaluate the average acoustic energy in a sector for a negligible cost, and to locate precisely a speaker within an active sector. One additional contribution that goes beyond the eld of microphone arrays is a generic, automatic threshold selection method, which does not require any training data. On the speaker detection task, the new approach is dramatically superior to the more classical approach where a threshold is set on training data. We use the new approach into an integrated system for multispeaker detection-localization. Another generic contribution is a principled, threshold-free, framework for short-term clustering of multispeaker location estimates, which also permits to detect where and when multiple trajectories intersect. On multi-party meeting recordings, using distant microphones only, short-term clustering yields a speaker segmentation performance similar to that of close-talking microphones. The resulting short speech segments are then grouped into speaker clusters (Who?), through an extension of the Bayesian Information Criterion to merge multiple modalities. On meeting recordings, the speaker clustering performance is signicantly improved by merging the classical mel-cepstrum information with the short-term speaker location information. Finally, a close analysis of the speaker clustering results suggests that future research should investigate the effect of human acoustic radiation characteristics on the overall transmission channel, when a speaker is a few meters away from a microphone

    A Sector-Based, Frequency-Domain Approach to Detection and Localization of Multiple Speakers

    Get PDF
    Detection and localization of speakers with microphone arrays is a difficult task due to the wideband nature of speech signals, the large amount of overlaps between speakers in spontaneous conversations, and the presence of noise sources. Many existing audio multi-source localization methods rely on prior knowledge of the sectors containing active sources and/or the number of active sources. This paper proposes sector-based, frequency-domain approaches that address both detection and localization problems by measuring relative phases between microphones. The first approach is similar to delay-sum beamforming. The second approach is novel: it relies on systematic optimization of a centroid in phase space, for each sector. It provides major, systematic improvement over the first approach as well as over previous work. Very good results are obtained on more than one hour of recordings in real meeting room conditions, including cases with up to 3 concurrent speakers

    Location Based Speaker Segmentation

    Get PDF
    This paper proposes a technique that segments into speaker turns based on their location, essentially implementing a discrete source tracking system. In many multi-party conversations, such as meetings or teleconferences, the location of participants is restricted to a small number of regions, such as seats around a table. In such cases, segmentation according to these discrete regions would be a reliable means of determining speaker turns. We propose a system that uses microphone pair time delays as features to represent speaker locations. A GMM/HMM framework is used to determine an optimal segmentation of the audio according to these locations. We also demonstrate how this approach is easily extended to more complex cases, such as the presence of two simultaneous speakers. Experiments testing the system on real recordings from a meeting room show that the proposed location features can provide greater discrimination than standard cepstral features, and also demonstrate the success of the extension to handle dual-speaker overlap

    A Sector-Based Approach for Localization of Multiple Speakers with Microphone Arrays

    Get PDF
    Microphone arrays are useful in meeting rooms, where speech needs to be acquired and segmented. For example, automatic speech segmentation allows enhanced browsing experience, and facilitates automatic analysis of large amounts of data. Spontaneous multi-party speech includes many overlaps between speakers; moreover other audio sources such as laptops and projectors can be active. For these reasons, locating multiple wideband sources in a reasonable amount of time is highly desirable. In existing multisource localization approaches, search initialization is very often an issue left open. We propose here a methodology for estimating speech activity in a given sector of the space rather than at a particular point. In experiments on more than one hour of speech from real meeting room multisource recordings, we show that the sector-based greatly reduces the search space. At the same time, it achieves effective localization of multiple concurrent speakers

    Threshold Selection for Unsupervised Detection, with an Application to Microphone Arrays

    Get PDF
    Detection is usually done by comparing some criterion to a threshold. It is often desirable to keep a performance metric such as False Alarm Rate constant across conditions. Using training data to select the threshold may lead to suboptimal results on test data recorded in different conditions. This paper investigates unsupervised approaches, where no training data is used. A probabilistic model is fitted on the test data using the EM algorithm, and the threshold value is selected based on the model. The proposed approach (1) does not use training data, (2) uses the test data itself to compensate for simplifications inherent to the model, (3) permits the use of more complex models in a straightforward manner. On a microphone array speech detection task, the proposed unsupervised approach achieves similar or better results than the ``training'' approach. The methodology is general and may be applied to other contexts than microphone arrays, and other performance metrics than FAR

    Sector-Based Detection for Hands-Free Speech Enhancement in Cars

    Get PDF
    Speech-based command interfaces are becoming more and more common in cars. Applications include automatic dialog systems for hands-free phone calls as well as more advanced features such as navigation systems. However, interferences, such as speech from the codriver, can hamper a lot the performance of the speech recognition component, which is crucial for those applications. This issue can be addressed with {\em adaptive} interference cancellation techniques such as the Generalized Sidelobe Canceller~(GSC). In order to cancel the interference (codriver) while not cancelling the target (driver), adaptation must happen only when the interference is active and dominant. To that purpose, this paper proposes two efficient adaptation control methods called ``implicit'' and ``explicit''. While the ``implicit'' method is fully automatic, the ``explicit'' method relies on pre-estimation of target and interference energies. A major contribution of this paper is a direct, robust method for such pre-estimation, directly derived from sector-based detection and localization techniques. Experiments on real in-car data validate both adaptation methods, including a case with 100 km/h background road noise

    Clustering And Segmenting Speakers And Their Locations In Meetings

    Get PDF
    This paper presents a new approach toward automatic annotation of meetings in terms of speaker identities and their locations. This is achieved by segmenting the audio recordings using two independent sources of information : magnitude spectrum analysis and sound source localization. We combine the two in an appropriate HMM framework. There are three main advantages of this approach. First, it is completely unsupervised, i.e. speaker identities and number of speakers and locations are automatically inferred. Second, it is threshold-free, i.e. the decisions are made without the need of a threshold value which generally requires an additional development dataset. The third advantage is that the joint segmentation improves over the speaker segmentation derived using only acoustic features. Experiments on a series of meetings recorded in the IDIAP Smart Meeting Room demonstrate the effectiveness of this approach

    Validación lingüística y psicométrica (adaptación cultural) de la escala Plutss para disfunción del tracto urinario inferior en niños colombianos

    Get PDF
    80% de los niños con ITU recurrente tiene algún síntoma de disfunción del tracto urinario inferior. Estos síntomas se clasifican según la ICCS (International Childrens Continence Society) de acuerdo a la fase del funcionamiento de la vejiga en la que presenten alteración, están los síntomas de llenado, los de eliminación y los asociados. Caracterizar estos síntomas, en forma objetiva para que no fueran simples relatos descriptivos de quejas de pacientes y pudieran ser utilizados para hacer diagnóstico y monitorear tratamiento obligó al uso de escalas que puntuaran cada uno de ellos. Estas escalas tienen su origen en el concepto del I-PSS (Puntaje Internacional de los Síntomas Prostáticos) que es una herramienta de gran utilidad para la clasificación de la hipertrofia prostática Hoy en día hay tres herramientas validadas para evaluar las alteraciones del tracto urinario inferior en niños; sin embargo ninguna de ellas ha sido a sido traducida al español ni adaptada culturalmente a la población hispanoamericana. El objetivo de este estudio es realizar la adaptación cultural (validación lingüística y psicométrica) de la escala PLUTSS,(4) que ya esta validada y es ampliamente utilizada; para aplicarla en un grupo de niños Colombianos estableciendo así el comportamiento de estos síntomas en nuestra población y para que pueda ser utilizada como herramienta de diagnóstico y seguimiento en los niños con alteración del tracto urinario inferior80% of children with recurrent urinary infection have any symptoms of lower urinary tract dysfunction. To characterize these symptoms, objectively forced the use of scales to rate each of them. Today there are three validated tools to assess the lower urinary tract disorders in children, but none has been been translated into Spanish and culturally adapted to the Hispanic American population. The aim of this study is to adapt the scale PLUTSS cultural, which is proven and widely used, to apply in a group of Colombian children, thus establishing the behavior of these symptoms. METHODOLOGY: The scale PLUTSS (Pediatric Symptom Score Lower Urinary Tract) was translated into Spanish adapted to Colombian dialect according to the admissions standards of translation, synthesis, back translation and recommendation of experts, was applied to a group of 34 patients with clinical diagnosis of urinary tract disorder lower and 95 healthy controls. Validation was conducted appearance, construct validation, we assessed the internal consistency of the instrument, and compared with results obtained in the original scale. RESULTS: The median of the two groups (healthy and diseased) was significantly different, with a sensitivity and specificity of 90% cut off point 1.5. Internal consistency of the 13-item scale was high alpha Crobanch, (0.88). Established the criterion validity of the scale with the clinical diagnosis found a significant correlation of strong character (CONCLUSIONS: The scale linguistic and psychometrically validated PLUTSS under international standards validation of scales is the only scale adapted Spanish. showed a high correlation with the clinical diagnosis and high power to discriminate urinary symptoms

    A Spectrogram Model for Enhanced Source Localization and Noise-Robust ASR

    Get PDF
    This paper proposes a simple, computationally efficient 2-mixture model approach to discrimination between speech and background noise. It is directly derived from observations on real data, and can be used in a fully unsupervised manner, with the EM algorithm. A first application to sector-based, joint audio source localization and detection, using multiple microphones, confirms that the model can provide major enhancement. A second application to the single channel speech recognition task in a noisy environment yields major improvement on stationary noise and promising results on non-stationary noise
    corecore